Syntactic Annotation for a Hittite Corpus: Problems and Principles
نویسنده
چکیده
The aim of the paper is to present a project of a syntactically annotated corpus of Hittite, a dead cuneiform language (Anatolian family), the oldest Indo-European language attested in writing, that was spoken in 18-12 cc. BC on the territory of present-day Turkey. No publicly available corpus of Hittite with syntactic annotation exists so far, meanwhile Hittite syntax proves to be more and more interesting for the researchers, so the need of an online annotated corpus for this language is more and more compelling. There are certain problems arising in development of such a corpus. Some of them are specific to the language itself, like 2P clitic chains, their position in the clause in terms of generative linguistics, and constituency structure of the Hittite clause. Others are connected to sociolinguistic peculiarities of Hittite system of writing: Akkadian and Sumerian logograms had been widely used by the Hittite scribes, and should be properly marked up in a Hittite corpus. Another problem is lacunae — clay tablets had been heavily broken in the last 3000–3500 years. What should be principles of phrase structure annotation when half the sentence is gone? The paper discusses these and others problems and principles on the material of the presented project.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملSyntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF)
This article presents the Syntactic Reference Corpus of Medieval French (SRCMF). The corpus is composed of texts taken from the two major Old French corpora, the Base de Français Médiéval and the Nouveau Corpus d'Amsterdam. This contribution describes some of the core principles of the annotation model, which is based on dependency grammar, as well as the annotation procedure and representation...
متن کاملThe Design of Syntactic Annotation Levels in the National Corpus of Polish
This paper presents the procedure of the syntactic annotation of the National Corpus of Polish. Syntactic annotation consists here of shallow parsing and manual post-editing of the results by annotators. The description concentrates on the delimitation of syntactic words and groups, as well as on problems encountered during the annotation process.
متن کاملPrague Dependency Treebank Annotation Errors: A Preliminary Analysis
This paper presents a basic analysis of syntactic annotation errors and inconsistencies in the Prague Dependency Treebank, the biggest corpus of Czech with manual syntactic annotation. The corpus is used for developing and testing of many syntactic analysers of Czech and the problems in the annotation have an essential impact on the evaluation of the quality of these parsers and the results of ...
متن کاملDeveloping An Arabic Treebank: Methods, Guidelines, Procedures, And Tools
In this paper we address the following questions from our experience of the last two and a half years in developing a large-scale corpus of Arabic text annotated for morphological information, part-of-speech, English gloss, and syntactic structure: (a) How did we ‘leapfrog’ through the stumbling blocks of both methodology and training in setting up the Penn Arabic Treebank (ATB) annotation? (b)...
متن کامل